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ASYMPTOTICALLY MINIMAX BAYES PREDICTIVE DENSITIES 

By Mihaela Aslan 

Yale University 

Given a random sample from a distribution with density func- 
tion that depends on an unknown parameter 0, we are interested in 
accurately estimating the true parametric density function at a fu- 
ture observation from the same distribution. The asymptotic risk of 
Bayes predictive density estimates with Kullback-Leibler loss func- 
tion D(fe\\f) — J fe log(/e//) is used to examine various ways of 
choosing prior distributions; the principal type of choice studied is 
minimax. We seek asymptotically least favorable predictive densities 
for which the corresponding asymptotic risk is minimax. A result 
resembling Stein's paradox for estimating normal means by the max- 
imum likelihood holds for the uniform prior in the multivariate lo- 
cation family case: when the dimensionality of the model is at least 
three, the Jeffreys prior is minimax, though inadmissible. The Jeffreys 
prior is both admissible and minimax for one- and two-dimensional 
location problems. 

1. Introduction. There has been a historical dispute between the clas- 
sical estimative density functions and the Bayesian predictive density func- 
tions in measuring the goodness-of-fit of the density estimate. For both the 
frequentist and Bayesian approaches to prediction inference, the choice of 
the prior is a serious matter either asymptotically or for finite samples. In 
this paper, we examine the asymptotic behavior of Bayes predictive density 
estimates under the Kullback-Leibler loss. These asymptotics are used to 
describe various ways of choosing prior distributions; the principal type of 
choice studied is minimax. Admissibility questions are also addressed for 
various families of densities. 

Suppose we are given a random sample x n = (xi,X2, • • • , x n ) of n indepen- 
dent, identically distributed observations with respect to a probability den- 
sity fo(-) = f(-\6), TZ P , that depends on an unknown, p-dimensional 
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D(fe\\f)= [log f J Xn+1 j 0) f(x n+1 \9)dx n+1 , 



parameter. In the Bayesian approach, we assume some density h(-) over 
to represent our prior knowledge of 6. A future observation x n +i from the 
same distribution is predicted by using a density /(-|x n ), which is called a 
predictive density. We are interested in a density estimation problem where 
the actual parameter to be estimated is the density at the next observation 
f(x n+ i\0), given the true, unknown parameter 9 = (9\, 62, ■ ■ ■ , 9 P ). 

A natural loss function used to measure the distance between the two 
densities for the next observation, fg and /, is the Kullback-Leibler diver- 
gence, 

f(x n+ i\0) 
/(a; n+ i|x n ) 

which is positive unless f(x n +i\6) coincides with /(x n +i|x n ). This measure 
depends on and the particular sample x n observed. While not being a 
distance due to lack of symmetry, the Kullback-Leibler divergence produces 
standard results and consistent density estimates and, in general, leads to 
a more tractable problem than other loss functions (the L 1 distance for 
example). Prom the Bayesian point of view, the Kullback-Leibler loss has 
historically been the main tool for obtaining noninformative priors; Jeffreys 
[9] used its invariance properties and local behavior as a Euclidean square 
of a distance function as a starting point in constructing and proposing the 
prior that carries his name. 

The Jeffreys prior density with respect to the p-dimensional Lebesgue 
measure, 

J(e)^det 1 / 2 ((Ly(6)) i j =1 _ p ), 

where (£ij(0))i,j=i ; ...,p is the information matrix Pe[—d 2 /d6id9j log/0] and 
P# represents the expectation with respect to fe, plays an important role 
in our framework. Inferences around the Jeffreys density are very suitable 
here, especially in invariance-related problems. It is asymptotically least 
favorable under entropy risk [5], and for a = ^, belongs to the family of 
relatively invariant priors proposed by Hartigan [7], 



dlogh E^p* 



dlog f e dlog f e dlog f e d 2 \ogf d 2 \ogf e 
oc — 7^ 777. 777, h 



86 j dOi dd r dOjdOi dOjdO r 



This family of prior densities is not equivalent to the family of all relatively 
invariant priors and we will refer to it as the a-family (or a-class). 

The risk function is the expected Kullback-Leibler loss with respect to 
fe- We consider this to be our measure of the goodness-of-fit of /(x n +i|x n ) 
to the unknown f(x n+ i\6): 



R(0J)=Pe(D(fe\\f)) 

f{Xn+l\0) 



// log 



/(x„+i|x n ) 



/(x n +i|0)/(x n |0) dx n+ i (fx n . 
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We also consider as our density estimate the Bayes predictive density for 
the next observation based on the prior h(6) and the data x n , 

fh(x n+1 \x n ) = J f(x n+ i\0)h(6\x n ) dO, 

where h(0\x n ) is the posterior density obtained by using Bayes' product 
formula, 

fe(g)/(x n |fl) 
fh(0)f(x n \0)dB' 

Thus, under the Kullback-Leibler loss, the predictive Bayes density estimate 
for the next observation is just the posterior density of the next observation. 

For samples of finite size, Aitchison [1] shows that when a specific prior 
density h(0) is given, any estimative density /(x n+ i|x n ) is inferior, in Kullback- 
Leibler risk, to the Bayes predictive density fh(x n +i\x. n ) . From the asymp- 
totic point of view, Komaki [10] gives an asymptotic expression for the 
Bayesian predictive distribution and shows that in the multidimensional 
curved exponential family case, the estimative distributions, given asymp- 
totically efficient estimators, can be improved to predictive distributions 
that asymptotically coincide with the Bayesian predictive distributions. 

Following the program of Hartigan [8] for finding the maximum likeli- 
hood prior density, we are searching for a prior distribution corresponding 
asymptotically to the minimax risk, as the number of observations n from 
fg increases to oo. We use asymptotic expansions of Kullback-Leibler risk 
functions in which the first order term £- is the same for all estimative 
and Bayes predictive densities, given any continuously twice differentiable 
positive prior density; we allow prior densities to have infinite total mass 
J e h(9)d9 = oo. 

Finding asymptotically minimax Bayes predictive density estimates is 
usually a hard task for general statistical settings, especially due to infinite 
parameter spaces. Choosing prior density functions for which the asymp- 
totic risk is minimax mainly reduces to solving very complicated differential 
equations in many dimensions, which may or may not have solutions. Even 
in noninvariant settings, by concentrating on a smaller class of priors with 
useful invariance properties (such as a class of relatively invariant priors), 
these differential equations become much simpler and we are sometimes able 
to arrive at minimax solutions. 

The main idea of this paper is to describe a searching algorithm for least 
favorable priors which starts by looking for minimax solutions among rela- 
tively invariant priors in the a-class. We compare different predictive density 
estimates by looking at the smaller order terms in the asymptotic risk. These 
^7 terms involve expressions in both the likelihood and the prior. Thus, 
choosing one density estimate over another reduces mainly to choosing prior 
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density functions that improve on the asymptotic risk. Admissibility and 
minimaxity questions are expressed in terms of certain differential opera- 
tors; the answers to these questions are then determined by the existence of 
solutions to different partial differential equations. 

Algorithm scheme. 
1. Compute asymptotic risk expressions of the form 



• A is the same constant for all estimative and Bayes predictive densities. 

• Different density estimates compete through the ^7 term, which de- 
pends on the likelihood and the prior. 

2. Find priors leading to asymptotically minimax density estimates: 

• Start by searching in a smaller class of priors and find those priors for 
which the asymptotic risk is constant in the parameter (in other words, 
find the optimum in a wide class of possible estimates). 

• Prove that the priors with the smallest constant risk are least favorable: 
show that they cannot be uniformly beaten over all priors by solving a 
differential equation in the parameter (in other words, show that this 
optimum is also the optimum among all possible estimates). 

This method is not restricted to relatively invariant priors in the a-class or 
to invariant statistical problems. The a-class merely represents a good "set 
of guesses" for an optimal estimate in the minimax sense. The methodology 
can be generalized to various distribution functions and, hence, to general 
statistical settings which do not present any symmetries or invariance prop- 
erties. 

One important application of this method is to the general location model: 
a result resembling Stein's paradox for estimating normal means by the 
maximum likelihood holds for the uniform prior in the multivariate location 
family case. Using differential geometry and, in particular, potential theory, 
we show that when the dimensionality of the location model is at least three, 
Jeffreys' prior is minimax, though inadmissible. The Jeffreys prior is both 
admissible and minimax for one- and two-dimensional location problems. 

2. General notation and main result. We begin by introducing certain 
notation that will be used throughout our discussion. Let 



n 



L(G) 



Tlfeixj) 
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n 
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3=1 
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be the likelihood and the log-likelihood functions for the sample of observa- 
tions x n . Also, let 

gr n g r 

k = << ' = de n • • . d e tr losm = g do h • • . d e tr lQ g^^)' 

Ai,i 2 ,...,i s = P»fti'i2 " " " h s \ 

be the log-likelihood derivatives and the expectations of their products, all 
evaluated at the true value 9; P# denotes expectation, given 9. The same 
quantities, evaluated at the maximum likelihood estimate 9, will be denoted 
by L, I, ^ and L- n - l2 i a . The matrix (— Lj,-)j J= i is called the Fisher 
information matrix. 

The model is supplemented by a prior distribution on with density 
function h with respect to the Lebesgue measure, where hi = log h and 

hij = qq.qq log h stand for the log prior first and second derivatives when 

evaluated at the true 9. Let h, hi and hy be the same quantities when 
evaluated at the maximum likelihood estimate. 

The following theorem gives the asymptotic expression for the Kullback- 
Leibler risk up to smaller ^ terms. Adopting tensor summation conventions, 
the various expressions that appear in our formula are in fact sums of terms 
over indices that appear twice. 

Theorem 1. Under regularity conditions stated in the Appendix, the 
asymptotic risk with terms of order 0(n _1 ) and 0(n~ 2 ), and ignoring smaller 
terms of order 0(n -3 ), has the following expression: 

R{eJh) = il-^ 



n 



13 1 

^i r L _■ c I —La r s -\ - La rs -\- Lirj ,s ~t~ tt-^ 



3 

+ LirjLfcst + —LijkL r ^ s t 

1 7 

+ -^Li r jL s kt + 7^LijkL rs t 

-1 r-1 
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Here p is the dimensionality of Q . 

Remark 1. LJ r denotes the (i,r) element of the inverse of the Fisher 
information matrix (— £jr)j,7-=l,...,p- 

Remark 2. The expression n{- ■ •} is the same for all n. 

Remark 3. The first-order term in this expansion coincides with the 
same order term in the asymptotic risks of the maximum likelihood and 
Bayes procedures ([8], Theorems 1 and 4). It also coincides with the upper 
bound for the asymptotic entropy risk from [5]. 

Remark 4. The prior expressions from the second order terms in the 
asymptotic risks of the Bayes predictive densities and of the Bayes estimators 
from [8] are the same. Thus, the difference in the 0(n~ 2 ) terms of the two 
asymptotic risks does not depend on the prior choice. The estimative density 
for maximum likelihood estimate is f{x\0). It may be shown that the ratio 
of predictive to estimative densities is asymptotically the same for all priors, 
therefore, the difference in the Kullback-Leibler distances does not depend 
on the prior and, indeed, the estimative density has risk no less than the 
predictive density. 

Remark 5. Let J denote the Jeffreys prior density. By adding and 
subtracting terms involving log Jeffreys' first and second derivatives, Ji = 
^-log J and Jij = g g 92 gg log J, all evaluated at the true 9, the 0(n~ 2 ) part 
of the asymptotic risk expression will separate into two distinct expressions, 
both invariant under monotone transformations of the parameter. We call 
these new expressions the likelihood and the prior terms: 

likelihood term 

— r s ( 2 ^J^'i 5 4^*j! rs ^- J it'j,s 

"f" ^ir^js^k \^\^ J hrjJ-'k,st + \LijkLt,rs + \^-'ijk^ J T,s,t 

-\- Lij.jLfc st -\- T^LijkLf st -\- ^LirjLskt "y^^-'iik^-'rst) 
~i~ L^ r L>j s ( 2^jr,j.s ~i~ 2Lij r s + Lij rs + Li r j s + 2-^i ; r,j,s) 
"f" ^ir ^j^s^k^t ( — ^LrsjLij.k ~ Lij ^L rs t — Lij ^Lrt s 

~\~ gLi jsLr fc t + 2l J rk,tI J i,j,s) 
+ ^i,r^j,s(^j,rs + L r j s )L k j(^Li^,t + Lik,t)', 
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prior term 

= L i,r{~ J r( h i ~ J i) + ( h ir ~ Jir) + \{h - Ji)(fl r - J r )}. 

Remark 6. If the group of invariant transformations of the parameter 
is transitive (so that a transformation exists mapping any parameter value 
into any other), then the likelihood term in the risk is constant. In this 
case, for relatively invariant priors, the prior term is also constant. Thus, for 
priors in the class of relatively invariant priors proposed by Hartigan [7], 

{hi = L-(aLij jS + Lij s )} a , 

the asymptotic risk expression is independent of the parameter and reduces 
to a quadratic function in a. Solving for a, one finds that the choice of a 
giving the asymptotically minimum risk satisfies the following condition: 

(a - ^L^rLj lL^Lij^Lr^j 

— ^i,r ^j,s ( ^ir,j,s ^ijt rs ^V'jijs) 

+ Lj tr LA S Lf. t (L r j s Li t k 7 t + 2L rSj feLjj ) t + L 1 . )St ] t Li t j ) t). 
The Jeffreys prior corresponds to a = \ and is a member of the class. 

2.1. The one-parameter problem with examples. A simpler asymptotic 
risk expression holds when 9 is one-dimensional. 

Corollary 1. For £ ® <ZTZ, the asymptotic risk expression becomes 

R{ejh)= L~^ 



i 

+— 



n j^l I ( 2^1,1,2 + ^2,2 + £1,3 + 

+ L l,l ( L l,2 + g -^3^1,1,1 + 2^3^1,2 + J^Lt 

+ L^\(Li )2 + L 3 )hi + U[\ (hi + -hi 



2 



+ 0(n~ 3 ). 



Having precise expressions for the asymptotic risk permits detailed eval- 
uations of admissibility and minimaxity. In the following subsections, we 
present some applications to both discrete and continuous distribution func- 
tions. Although it is true that some of these examples can be done in finite 
sample settings, our general technique agrees with them and offers a method 
of arriving at minimax solutions. To simplify our calculations, we will assume 
that n = 1. 
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2.1.1. The Poisson example. Let x be an observation according to the 
Poisson distribution. From Corollary 1, the O(^) term in the asymptotic 
risk that depends on the parameter is of the form 

*•+»(*• +5*0 "is- 

Within the class of relatively invariant priors {h = # a-1 } a , the priors h = 

q±i/Vg corresponding to a = 1 =t -4= have constant risk. The prior corre- 

vo 

sponding to a = 1 has risk everywhere smaller than these priors, so they are 
inadmissible. However the maximum risk for any prior is never less than the 
risk for these priors, so they are minimax. 

2.1.2. The binomial example. Let x be an observation according to the 
binomial distribution with the canonical parameter 6, Bin(l, j^g)- As in the 

Poisson case, the \ term of the asymptotic risk depending on the parameter 
is easily computed as being 

J_{( A2 + i ft; ) (1+eV _, l(1 _ e2 . ) + l + _L e « + A e ,»}. 

It can be shown that the prior corresponding to a, = 1 + -j= has constant 
risk and is minimax among all positive priors. 

2.1.3. The negative binomial example. For the negative binomial distri- 
bution in the canonical parameter 9, J\fBin(r,l — e 9 ), the \ term in the 
asymptotic risk is of the form 



e °f - hl (l- e 2e ) + ^--e e + -e 2e 
' v ' 24 12 24 



Following the binomial case, the prior corresponding to a = 1 — ^? gives least 
constant risk within the a-family; we have not been able to show minimaxity. 

2.1.4. The normal location-scale example. Suppose we have an observa- 
tion x according to the normal location-scale density function, J\f(fi,a 2 ). 
Due to obvious invariances with respect to groups of transformations over 
the sample and the parameter spaces, the asymptotic risk expression reduces 
to 

R( ° 2 - w = I + + £ + ^ + ^ (*» + H + \ } + °<"- 3 >' 

where each subscript 2 for the prior represents differentiation with respect 
to a 2 . 

Unlike the normal scale case, the Jeffreys prior, </(//, a 2 ) cx <7 -3 , is neither 
admissible nor minimax: the prior corresponding to a = | in the a-class has 
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a strictly smaller asymptotic risk than Jeffreys' for all 0. This agrees with 
the finite sample result that the Bayes predictive density based on a^ 1 d\ido 



2 
3 

has a strictly smaller asymptotic risk than the Jeffreys prior [12] . 



(which corresponds to our a = 3) is the best invariant predictive density and 



2.1.5. The multivariate normal scale example. Let x n = (x±, . . . ,x n ) be 
a random sample according to the multivariate normal scale distribution 
with the log-likelihood function 

n 

l(V)=logf(x\V) = -l Yl WijXiXj- ±log|y|+ct, 

where V is a symmetric and positive definite covariance matrix, V = W~ l . 
Also let 

kH) = khjl)-(irjr) = EfflT -pjTT—^OgfviXk), 

k = l UVl m ' ' ' UV lr]r 

L (iljl),(i2j 2 ),...,(i s j s ) = *V [l(hh)h2h) • • • Z (i S js)]> 

be the log-likelihood derivatives and the expectations of products of the log- 
likelihood derivatives, where single indices represent pairs of indices identify- 
ing the variance-covariance parameters. For example, var(Xj) = Vu = , 
cov (Xi,Xj) = V tj = Wr j 1 . For each pair of indices (it 1 ), it is assumed that 
i < i'. 

We give, without proof, the following lemma that can be found in [2]: 

Lemma 1. For any pairs of indices (ii r ), (rr'), (jf), (ss'), the following 
expressions for the expectations of different products of the log-likelihood 
derivatives are true: 

L(ii>) = 0, 

W ir Wi'r' + Wi r 'Wi> r 



L (ii')(rr') 2 | i= j/} + { r=r /} 



L (U')(rr')(jj') 



1 



2 {i=i'}+{r=r'}+{j=j'} 

X {Wi/jWr'iWj'r + WijWr/i/Wj'r + Wi' r Wj'iW r 'j 
+ WirWj'i'Wr'j + Wi'jWriWj'r' + Wi' r 'Wj'iW r j 
+ Wi'fWr'iWjr + Wi'rWjiWr'f}, 



L (H')(rr')(jj')(ss') 
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1 

~~ ~~ 2{i=i'}+{r=r'}+{j=j'}+{s=s>} 

x {sum of 48 terms like Wi s Wi' jWj/ r W r i s i , 

where, for each pair ofWs, the indices must come from at least 
three pairs of indices in the Us. For example, Wi s Wi>j comes 

from the pairs ii' ,jj ,ss }. 
Also, the inverse information matrix components are 

Using Remark 6 above, in the a-class of priors, which is of the form 

{ha* = aL^,^ £(ii'),(jj').( s <s')/c!' 

the choice of a giving the asymptotically minimum risk satisfies the following 
condition: 

= L ( ii'),(rr') L (jj'),(ss>) ( L (ii')(rr')(jj')(ss') 

(2.1) 

+ L (ii'),(rr') L (Jj'),(ss') ~ L (rr'),(ss') L (ii'),(jj')) 
+ L (ii'),(rr') L (j]'),(ss') L (kk')^tt') L (rr%ss')^ 

Through simple manipulations of the likelihood identities (A5) in the Appendix 
and the formulae in Lemma 1, explicit expressions for all the terms in (2.1) 
become available. Due to invariance arguments, it can be shown that the 
a = 7} solution to (2.1), which corresponds to the Jeffreys prior, has the 
minimum asymptotic risk within the a-class of priors. 

In finite sample theory, Murray [13] and Ng [14] prove similar results for 
general group models under invariant prediction. 

For the univariate normal scale case, the Jeffreys prior is also minimax 
among all smooth priors available: through a simple reparametrization of 
the form u = log a 2 , and following the argument in Section 3, this problem 
becomes a location problem for which one can prove that the uniform prior 
in the new parameter is least favorable among all smooth priors, and so is 
minimax. 

3. Minimaxity and admissibility in the location case. One important 
application of the searching method for minimax solutions is to the general 
location model. As in the Stein problem of estimating multivariate normal 
location parameters, we prove that a similar division between dimensions 2 
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and 3 holds for predictive density estimates for the general location problem. 
For dimensions 1 and 2, the Jeffreys prior is admissible, but for dimensions 
greater than 2 there are priors that have everywhere smaller risk than Jef- 
freys' so that the Jeffreys prior, though minimax, is inadmissible. 

In finite sample theory, the Stein phenomenon in density estimation has 
been explored by Komaki [11] who showed that for the multivariate normal 
location model, the Jeffreys prior produces density estimates admissible in 1 
or 2 dimensions, but inadmissible in 3 or more, just as Stein did for location 
estimates. For the same multivariate normal location model, George, Liang 
and Xu [6] go further than Komaki and show that under certain conditions 
on the marginal of the prior, the corresponding Bayes predictive density 
becomes minimax. 

We prove a more general result by using the asymptotic risk expression 
from Theorem 1 to evaluate admissibility and minimaxity of density esti- 
mates in a general location model setting. The risk evaluations require the 
study of elliptic differential operators. For parameter estimation, such oper- 
ators, in a simple form, appear in [4] . We give here the general form of such 
operators for density estimates. 

Let /(x — fi) represent a general location model with the standard proba- 
bility density function /(x) and fi the location parameter for the family. For 
general multivariate location families, the likelihood term is constant under 
invariant and transitive transformations of the parameter. Thus, the risk 
expression depends on the parameter only through the prior term, which 
has the expression 

v 

^2 L^{-J r (hi - Ji) + (h ir - J ir ) + \{hi - Ji){h r - J r )}, 

i,r=l 

where h is any continuously twice differentiable positive prior among all 
priors available and J is the Jeffreys prior. The Jeffreys prior being constant, 
all of the log-Jeffreys derivatives are 0. Note that for h = J, the prior term 
is 0. 

Choosing h of the form g 2 , g > 0, the part that remains to be optimized 
becomes 



V> r -ir . , v- 1 d 2 /dmdfj, r g Ag 

2^ L i,r \9ir + mdr) = }^ L i,r " = — . 



P 

£ 

i,r=l i,r=l 



where A stands for the Laplacian differential operator and Ag = Yli=i ~§^?9- 
There exists a linear transformation on X that converts the information 
matrix to the identity. Thus, the L~l factor is constant in the parameter. 
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Theorem 2 (Admissibility). For p = 1 and p = 2 £/ie Jeffreys prior is 
admissible: there is no other prior g such that 

Ag < for all \x, 

Ag < for some \i. 

For p > 3, the Jeffreys prior is inadmissible: there exists a prior g such that 
Ag < for all ji; however, there exists no prior g that dominates Jeffreys 1 
uniformly by a positive amount, that is, there exists no g such that for some 
c>0, 

— < — c < for all a. 

g ~ 

Corollary 2 (Minimaxity). In one and two dimensions, the admissi- 
bility of the Jeffreys prior supports its minimaxity by the constant asymptotic 
risk. For location models of higher dimensionality, Jeffreys 1 is also minimax 
because it cannot be dominated uniformly. 

Proof of Theorem 2. The case p= 1 is a standard convex functions 
result. The case p = 2 is simply Liouville's theorem (see [15]). 

Case p>3: If we assume that Ag < everywhere, we can find a prior g 
that makes the Jeffreys prior inadmissible. An example of this kind is 



s(/*) = (i+£> 




which, for > a. > 1 — |, satisfies the condition Ag < for any fi. 

Following Asian [2], we consider as our domain in 1Z P a solid sphere D 
with radius r and its surface S:{R = r}. We also consider the one-to-one 
mapping to "cylindrical" coordinates 

/x^ (R,s), 

with \fjb\ < R 2 , s being of dimension p — 1 and the Jacobian of the mapping 
being 



d(R,s) 

Gauss' divergence theorem applies and 



R p - 



or simply 



J r "' Is rP ' 1 ^ dS = I J J D RP ~ lA 9 dRds ' 
g' = J J... J RP' 1 AgdRds, 
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where g is the function obtained by averaging g over S, g{r) = J •• • J s g(R, 
s) ds > and g' is its derivative. By using the Leibniz rule for differentiation 
and the hypothesized inequality Ag < —eg for a positive constant c£lZ, we 
obtain 

(r- p ~V) < -cr p ^g. 

ar 

Making the change of variable u = r 2 ~ p with du = (2 — p)r 1 ~ p dr and 
absorbing all the constants into c, we obtain the new differential inequality 

(3.1) f<-cu x g Vu>0, 

where -6 < A = 2fe=!2 < -3. 

— 2-p — 

Assume first that g'(u) < for all u. This implies that g is strictly de- 
creasing. Using the Taylor series expansion around uq < u and (3.1), we 
obtain 

g(u) < g(uo) + {u- u )g(u ) - — (u- u ) 2 g(u*), 

where uo <u* < u. Since g'(uo) < 0, g(u) — > — oo as u — > +oo, and this holds 
for any uo > 0. Therefore, g' must be nonnegative for all u. 

Similarly, using a Taylor series expansion around for u£ [0, t], t small, 
and by the strict concavity of g, together with (3.1) and the increasing 
monotonicity of g in a small neighborhood of u, we obtain 

g{\u)[l + \cu 2+x ]<g(u). 

Since A + 2 < — 1, we have u 2+x — > exo as u — > 0. Thus, for small u, g{\u) < 
^g(u), which contradicts the strict concavity of g. The Jeffreys prior is in- 
admissible, but remains minimax for the case where the dimensionality of 
the model is at least three. □ 

In general, for invariant problems where the likelihood term in the asymp- 
totic risk is constant, finding minimax solutions reduces to finding least fa- 
vorable priors for which the prior term is minimax. If a reparametrization 
of the problem exists in which the Fisher information matrix is the identity, 
then the same argument used in the location case shows that the Jeffreys 
prior is asymptotically minimax. 

APPENDIX: ASYMPTOTICS OF RISK 

We present here the main assumptions and ideas used in obtaining the 
result of Theorem 1. For a more elaborate and computationally involved 
presentation of the proof, see [2]. 

The asymptotics of the risk involve both Taylor series and Edgeworth 
approximations which require appropriate regularity conditions. The work 
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of Bhattacharya and Ghosh [3] gives a rigorous account of the theory of 
Edgeworth series for general statistics. The Taylor series approximations are 
polynomials in (9 — 6) with remainder terms which require special attention 
in order to integrate successfully. We also need expectations to evaluate risks 
accurately to 0(n~ 2 ) terms. 

The locally asymptotic normality of the standardized maximum likelihood 
estimator y/n(6 — 0), as well as truncated expectations in the sense given 
by Hartigan [8] is used extensively here. The following assumptions, similar 
to Hartigan's, are required for the validity of our expansions: 

Assumptions. (Al) The prior density h is smooth in the sense that it 
is twice continuously differentiable in a neighborhood of 6 and is positive. 
(A2) We assume that the second derivatives jLzl(Q) are of order n, where 

i 

n is the number of observations. We also assume that all Z's and Us are, in 
general, of order n with the exception of l^s, which are random variables of 
order y/n with zero expectations. These assumptions are usually satisfied in 
practice. 

(A3) 1(0) = log/(x|0) is five times continuously differentiable with re- 
spect to in a small neighborhood of the true parameter 6 for each obser- 
vation x. 

(A4) All moments exist for the first four log-likelihood derivatives, and 
for the maximum squared fifth derivatives, in a neighborhood of 0; in other 
words, for each € & and for some e > 0, 

P g\K...M^)\ < °°i 
Pg( sup \lh,...,i 5 (0)\) <oo, 

\0-6\<e J 

where the set of indices (i\, . . . , i r ) C {1, . . . ,p} r , with r = 4 and r = 5, re- 
spectively. 

(A5) The integral / f(x\6)dx can be differentiated four times with re- 
spect to 6 under the integral sign. The usual likelihood identities, obtained 
by differentiating J fg, are valid for any indices i, j, k and I: 

= Li, 

= Lij + Lij , 

= Lijk + Lijk + Ljfcj + Ljki + Lijjc, 
= Lijki + Lijk,i + Liji t k + Liki } j 
+ LjMi + Lijki + Likji + Liijk 

+ ^ij,k,l + LikjJ + Lnj^ 

+ Ljk } i } i + Lji t i t k + Lki t i t j + Lij^ji- 
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(A6) The Fisher information matrix (— £jj)ij=i „j> = (Lij)ij=i t ,.. tP , or 
Li j for short, is nonsingular and positive definite for \9 — 6\ < e. 
(A7) For each e > 0, P{|0 - §\ > e} = o(n~ 2 ). 

Proof of Theorem 1. Through straightforward calculations the risk 
expression becomes a difference between two Kullback-Leibler losses. When 
9 is true, we have 



where /(x) stands for the marginal density of x. 

To arrive at the risk asymptotics, we begin by computing an asymptotic 
expression for the Kullback-Leibler loss, 



The following lemma gives the asymptotic behavior of /(x). The result 
and its proof can be found in [2] . 

Lemma 2. Under the previous regularity conditions, the marginal den- 
sity /(x) has the following asymptotic expression with terms of order Op(n~ v ) , 
ignoring smaller Op(n~ 2 ) terms: 



Using Lemma 2 and the "Expectation lemma" in [8], the Kullback-Leibler 
loss expression becomes 



(A.l) 



R(0J h ) = J D(/(x n+1 |0)||/(x n+1 )) - £>(/(x n |0)||/(x n )) 





L>(/(x|0)||/(x)) 
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Following Asian [2] , the first two integrals in the Kullback-Leibler expression 
are further expanded into the following asymptotic expressions: 

(log/(x|0)-log/(x|0))/(x|0)dx 
-- P - + L^L^(- l -L - - l -L -h - l -L 



1 . 1 



LijkL r>s t — —LirjLgkt — -^LijkL rs t J + Op(n 



and 

p 1 
Z + 2 



/log M!?L— /( x |0)dx 



^-logd^D-log/i 



L L 1 



~ La_rs . Li 



1 

T 



+ L^L^l ( -L rjjS hi - -L rjs hi j + -L^hir + P (n 



-2> 



By simply substituting these expressions into the Kullback-Leibler loss 
formula from above, we obtain the following asymptotic approximation of 
the loss: 

Lemma 3. Under the previous regularity conditions, the asymptotic ex- 
pression for the Kullback-Leibler loss function D(f(x.\9)\ |/(x)) with terms 
of order Op{n~ l ) and ignoring smaller terms of order Op(n~ 2 ) is as follows: 

D(/(x|0)||/(x)) 

= -|log2vr - | + -\og(\L id \) - log/i 

+ L -iW -L - -L — L -h 

■Ir-lr-l/ !- - !- - 1 



+ Li r Lj S L k t ( ——Li r jLk s t — —Li jkLf rs — —LijkL r s t 
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irj-L/k,st 



irjJ-'skt 



12 



+ L i,r L j,l(~ L rj,s - L r j s )hi + L ir I —h 



'-hitlr 



+ P {n 



We now arrive at the asymptotic risk approximation in Theorem 1 by 
simply substituting in (A.l) the two asymptotic Kullback-Leibler loss ex- 
pressions, written more concisely as 

D n+1 = -|lo g 2vr - P - + Ilog{(n + 1) P |L M |} 



and 



D n = -| log 2vr - V - + i log K|L M |} 

-log/iW-^ + Op^ 2 ), 
n 

where = D(/(x n+ i|0)||/(x n+ i)) and D„ = D(f(x a \0)\ |/(x n )). Thus, 

the difference D n+ \ — D n will be of the form 

P P , G(0) | 0>(n - 3); 
2n 4n 2 n 2 

where 67(0) is the n{- ■ •} term in the asymptotic risk expression of Theo- 
rem 1. □ 
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